cw_title.JPG

Code Walk


Problem Overview

General Approach

cw_approach.jpg


Methodology

cw_flow.JPG


Random Forest Algorithm

cw_rf.JPG

cw_rfps.jpg


XGBoost Algorithm

cw_xgb.jpg

cw_xgbps.png


Support Vector Machine Algorithm

cw_svm.png

cw_svmps.jpg


EDA

Libraries:

Import Data

Target Information

There are no missing values out of range values. Of the 8294 customer_IDs, 6096 (73.4 %) have a label of 0 (good customer, no default) and 2198 (26.5 %) have a label of 1 (bad customer, default).

Null Values

Categorical Variables

Continuous Variables

Duration of Customer Data


FEATURE SELECTION

During EDA, we saw that most of the candidate predictor variables contained < 10% null values. We decided to remove the variables with > 10% null values; the bulk of the variables we are removing are > 90% null values.

Feature Engineering

In order to use Scikitlearn PCA, we will drop the categorical variables as well as S_2, which is a date/time variable. these will be evaluated separately.


Low Risk Goal: Stepwise Selection Model

Libraries:

Data Transformation

Principal Components Analysis

Here we see that 23 variables account for 95% of the variation in the data. Therefore, when we do feature selection, we will choose k=23.

Feature Extraction

We can also visualize this:

Train-Test Split

Fit the Stepwise Model

Metrics


Medium-Risk Goal: Compare Random Forest, XGBoost, and SVM Performance


Random Forest Model

Fit the Random Forest Model

Metrics


XGBoost

Libraries:

Hypertune the parameters

Fit the XGBoost Model


Support Vector Machine

</center>

Libraries:

Fit the SVM Model


Moving Forward

In progress:

High-risk goal: